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minimally competent candidate will answer correctly each item comprising the 
test. In many cases, these item performance estimates are made twice, with 
information shared with SMEs between estimates. This estimation process can 
be time-consuming and fatiguing, especially for the SMEs. This study 
addressed the possibilities of saving time and resources in an Angoff 
standard setting study. The study used three sets of databases, two from a 
medical health certification program, with 10 judges for each data et, and 
one from a financial analyst certification study, with 33 judges. The study 
shows that using subsets of items, as opposed to the full length test, could 
be an important consideration that could reduce the time and resources 
necessary to conduct a standard setting study. The results of this study 
suggest that 50% of test items may be sufficient to estimate an equivalent 
minimum passing score in an Angoff setting study. This could result in a 
substantial saving of time and resources not only for the agency that has 
been carrying out this activity, but also to the practitioners (panelists) 
who participate in the standard setting study. (Author/SLD) 
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Abstract 

In an Angoff standard setting procedure, subject matter experts (SMEs) estimate the probability 
that a hypothetical randomly selected minimally competent candidate (MCC) will answer 
correctly each item comprising the test. In many cases, these item performance estimates are 
made twice, with information shared with the SMEs between estimates. Especially for long tests 
this estimation process can be time consuming and fatiguing for the SMEs. This study addressed 
the possibilities of saving time and resources in an Angoff standard setting study. This study 
showed that using subsets of items as opposed to the full-length test could be an important 
consideration that could reduce the time and resources necessary to conduct a standard setting 
study. The results of this study suggest that 50 percent of test items may be sufficient to estimate 
an equivalent MPS in an Angoff setting study. This could result in a substantial saving of time 
and resources not only for the agency that has been carrying out this activity but also to the 
practitioners (panelists), who participate in the standard setting study. 
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The Use of Subsets of Test Questions in an Angoff Standard Setting Method 

Background 

The Angoff (1971) method is one of the most prominent and widely used test-centered 
standard setting procedures. In this method, a panel of judges is used to set minimum passing 
scores. These judges are considered to be experts in the content domain being assessed (Jaeger, 
1991). The judges are asked to conceptualize a randomly drawn minimally competent candidate 
(MCC) and to estimate the probability that the MCC will correctly answer each of the items in 
the test. These item performance estimates are summed across the items in the test, to yield an 
individual judge’s MPS. These individual MPSs are then averaged to estimate a recommended 
MPS. 

In most cases, the recommended MPS is derived in number of iterations of the Angoff 
procedure; usually the number of iterations is two. In the first iteration, an initial estimate of 
MPS is derived; then the judges are provided with data either about candidate performance (p- 
values and/or impact data) or initial MPS values for the panel members (Reckase, 2001). In the 
second iteration, the judges are asked to re-estimate the proportions of MCCs who would answer 
each item correctly. Based on the judges’ revised estimates, the recommended MPS is derived 
using the same procedures used in the first iteration. The second set of estimates are considered 
to be better informed and therefore lead to more defensible standards because many sources of 
error due to judges’ misunderstanding, carelessness, inconsistencies, and mistakes are removed 
from their estimates (Hambleton, 1998). 

Based on the above description of the Angoff method, it is evident that the recommended 
MPS is often based on two iterations based on the judges’ estimates of the MCC’s performance 
on each and every item in the test. If the recommended MPS could be estimated using a subset of 
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the test items as opposed to the full-length test, a saving in time and resources could be realized 
for the agency that has been carrying out this activity. Moreover, the judges who participate in 
this activity of setting a passing score are often practitioners in their profession who may have to 
miss business opportunities or close their offices when they participate in the standard setting 
study. Therefore, serving a longer time in the standard setting process not only costs the 
individual panelists but also the organizations they are associated with. However, it must be 
demonstrated that equivalent results would occur if only a subset of the full test was used in the 
standard setting study. 

Few studies have been carried out concerning cost reduction in a standard setting study. 

A study done by Harvey and Way (1999) developed a web-based standard setting system to 
offset the costs of travel of the judges to a central location. The results of the study suggested 
that recommended MPS from an Internet study would be similar to those from a monitored on- 
site study. Sireci, Patelis, Rizavi, Dillingham, and Rodriguez (2000) showed that the MPS values 
derived using only two-thirds of the items composing a CAT item pool were very similar to 
cutscores from using the entire item bank. However, the study involved only a single panel of 
experts and evaluated the method using a single test. 

The purpose of this study was to compare the MPS values estimated using a variety of 
subsets of items as opposed to the full-length test in certification examinations. 

Methods 

Data 

This study used three sets of databases, two from a medical health certification program 
and one from a financial analyst certification program. Two separate standard setting studies 
were conducted for the medical health program, one in 1995 and the second in 2000. The 
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examination that was used in each of these studies consisted of 1 1 0 operational items, but there 
were no common items across these two examinations. In both studies, panels of 10 judges 
participated, but there were no common panelists across the two studies. The estimates of the 
MPS were obtained in two iterations of the Angoff standard setting method. Only the 
performance estimates provided on iteration two were used to calculate the estimated MPS 
values. 

The financial analyst study was conducted in 2001 and consisted of a total of 230 items. 
A panel of 33 judges participated in the study. The panelists were randomly divided into two 
groups. The two panels (A and B) each looked at all 230 items in the test. The analysis from 
panel A was cross validated with the analysis from panel B. Estimates of the MPS were obtained 
in two iterations. The data from iteration two was used in this study. 

Both agencies developed their tests based on tables of specifications designed to 
represent the content categories for their certification area. For the medical health examination 
there were a total of six categories. The financial analyst examination also had six content 
categories. Each content category was weighted by the agency based on its important to the 
respective certification decision. The proportion of items in each content category comprising 
each test was consistent with these weighting. 

Formation of Subsets 

Before forming the subsets, the test items were grouped into ten categories based on the 
difficulty level of the items. The items that had difficulty level of 0-0. 10 constituted category 1, 
items with difficulty level of 0.1 1-0.20 was category 2, and items with difficulty level of 0.90- 
1.00 comprised category 10. 
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In this study, we extracted 5%, 10%, 20%, 30%, 40%, 50%, 60%, and 70% of the total 
items to constitute the subsets. To select the items from the full-length test, a stratified random 
sampling technique was used. The item difficulty level categories were used as the strata. Items 
were selected randomly proportionately to the total number of items in the full-length test that 
appeared in each of the ten strata. This same approach was used creating the sub-set of items for 
all four of the databases - two for medical health (1995 and 2000) and two for financial analyst 
(Panel A and Panel B data). 

Data Analysis Plan 

The MPS for the full-length tests (certification examinations for medical health and 
financial analyst) had been estimated through an Angoff method, in which the panelists’ second 
set of item performance estimates were summed across the items in the test, then these individual 
MPSs were averaged to estimate an overall MPS. The MPS values for the subsets were estimated 
in the same way as MPSs for the full-length test had been estimated. These MPSs were 
compared for equivalence. For the medical health studies, the MPSs based on the eight subsets 
were compared to the MPSs of the full-length tests from the 1995 and 2000 standard setting 
studies, respectively. For the financial analyst study, the comparison of the eight MPS values to 
the full-length test MPS were determined for both panels, A & B, allowing for a cross validation 
of the results. The obtained MPSs for subtests and the full-length test then were compared using 
a one-way ANOVA to determine whether there were any statistically significant differences 
between the MPSs. 

As a follow-up analysis, repeated samples were drawn for the relevant subsets to examine 
the pattern of the results. Finally, the same number of items in the relevant subsets of interest 
were also selected using a simple random sampling to examine whether the results obtained were 
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dependent on the item selection method. The purpose of these follow-up analyses was to provide 
substantive information relevant to the item selection technique. 

Results 

Medical Health Studies 

The results from the Medical Health Study 1995 (Table 1) showed that the average 
difficulty level of the full-length test was 83. This was determined by summing the p-values of 
the items that comprised each of the tests. For sub-tests that were shorter than the full-length test, 
the average difficulties were found to be comparable to that from the full-length test. Using the 
subset of 50% items, the average difficulty level (82) differed only by 1 -point from the average 
of the full-length test. However, a maximum difference of 3-points was found for the subsets of 
5% and 10% items. 

The average difficulty level of the full-length test for the Medical Health Study 2000 was 
91 (Table 2). Using the subset of 40% and 50% items, the average difficulty level (90) differed 
only by 1 -point from the average of the full-length test. The maximum difference of 3-points was 
found for the subset of 5% items. 

The Medical Health Study 1995 and 2000 showed that the MPSs using the full-length test 
were 72 and 86 respectively (Table 1 and Table 2). These same MPS values were also obtained 
when using 40% and 50% of test items in the respective studies. For both sets of (1995 and 
2000) data, the remaining subsets that contain of 5%, 10%, and 30% of items, produced an MPS 
estimate that had a maximum difference of 3-points, however. None of the estimates were found 
to be statistically significant (p > 0.05). The standard error of measurement was also calculated 
for both the data sets to assess the stability of the estimated MPSs. They were found to be 2.40 
for Medical Health 1995 and 1.33 for 2000. To be more conservative, we lowered it to the 
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nearest integer values, i.e., 2 for 1995 and 1 for 2000 Medical Health study. The estimates of 
MPS for both studies obtained using 50% of items fell within the one SEM of the MPSs of the 
full-length tests. 

Financial Analyst Study 

The results from panel A, Financial Analyst Study 2001 showed that the average 
difficulty level of the full-length test was 144, which was also obtained from the subsets of 30%, 
40%, 50%, 60% and 70% items. The maximum difference of 7-points was found for the subset 
of 5% items (Table 3). 

The average difficulty level of the full-length test for panel B, Financial Analyst Study 
2001 was 144 (Table 4). Using the subset of 50% items, the average difficulty level (145) 
differed only by 1 -point from the average of the full-length test. The maximum difference of 7- 
points was obtained for the subset of 5% items. With subsets of 20%, 30%, 40%, 60% and 70% 
items, the average difficulty level were exactly the same as it was for the full-length test. 

The results for the Financial Analyst study 2001 for panel A and panel B (Table 3 and 
Table 4) showed that the MPSs using the full-length test were 157 and 143 respectively. Using 
50% of the test items, the respective panels’ estimated MPS values were 157 for panel A and 145 
for panel B, which were zero points and 2-points apart from the MPSs of full-length tests 
respectively. The estimates obtained from the remaining subsets (5%, 10%, 20%, 30%, and 40% 
of test items) for panel A and B, had maximum difference of 27-points and 5-points from the 
MPSs of full-length test, respectively. However, the estimated MPS for the 5% of items for Panel 
A only, was found to be statistically significant (p>0.05). The standard errors of measurement 
were found to be 3.69 for panel A and 3.28 for panel B. If these were lowered to the nearest 
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integers, the SEM was 3 both for panels. The estimated MPSs using 50% of test items for both 
the panels were within one SEM of the MPSs of the full-length test. 

Follow-up Analysis 

The estimated MPS values and the average difficulty level based on the stratified random 
sampling of 50% of the test questions yielded promising results for both the Medical Health and 
Financial Analyst analyses. These values were systematically equal or at most one point from the 
full-length test MPS values. As a follow up analysis, two repeated stratified samples of 50% 
were generated to examine the stability of the results. For the Medical Health 1995 data, the two 
repeated stratified samples had average difficulty level of 82, which was within 1 -point 
difference from the average of the full-length test (83) and yielded MPS estimates of 71 and 72 
compared to 72 for the full-length test (Table 5). For the 2000 data, both repeated stratified 
samples resulted in estimated MPSs of 85 compared to 86 from the full-length test and had an 
average difficulty level of 90, which was within 1-point from the average of the full-length test 
(Table 5). For Financial Analyst data, the repeated stratified random samples of 50% items had 
an average difficulty level of 145 for panel A, and 144 and 145 for panel B, which were within 
1 -point of the average of the full-length test and resulted in estimated MPS values that were no 
more than 1-point different from the full-length test MPS values (Table 6). Therefore, all the 
estimated MPSs using 50% of test items were found to have no statistical significant difference 
(p > 0.05) and were statistically equivalent to the MPS using the full-length test. 

As the estimated MPS values based on the stratified random sampling of 50% of the test 
questions generated a systematic and stable results, our next concern was to examine whether 
these stable results were due to the any particular item selection methods. Therefore, a sample of 
50% items was drawn using a simple random sampling method from each of the data sets. 
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The estimates of the MPSs obtained from the 50% random samples had maximum 
difference of 7-points from the respective MPS of the full-length tests. Medical Health 1995 data 
resulted in estimated an MPS of 72 (compared to 72 from the full-length test) and the 2000 
Medical Health data yielded MPS estimates of 84 (compared to 86 for the full-length test) (Table 
5). However, the average difficulty level of the random samples differed by 1 -point and 4-points 
from the average difficulty level of the full-length tests 1995 and 2000 respectively (Table 5). 

For the Financial Analyst data, the estimated MPS values for panel A and B are 164 (compared 
to 157 from full-length test) and 148 (compared to 143 from full-length test) respectively (Table 
5). For Panel A and B, the average difficulty level of the random samples differed by 3-points 
and 2-points from the average difficulty level of the full-length test, respectively (Table 6). 

Discussion 

This study addressed the possibilities of saving time and resources in an Angoff standard 
setting study. This study showed that using subsets of items as opposed to the full-length test 
could be an important consideration that could reduce the time and resources necessary to 
conduct a standard setting study. Reducing the number of items would result in savings of time 
and resources when conducting a standard setting activity, in particular during ratings of the 
items in each round (in the case of a multiple round standard setting study), disseminating 
feedback data, and entering data between rounds. Moreover, the judges who participate in this 
activity of setting passing scores are often practitioners in their profession who may have to miss 
business opportunities or close their offices when they participate in the standard setting study. 
Therefore, serving a longer time in the standard setting process not only costs the individual 
panelists but also the organizations they are associated with. The results of this study suggest 
that a stratified random sample of 50 percent of test items may be sufficient to estimate an 
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equivalent MPS in an Angoff setting study. This could result in a substantial saving of time and 
resources not only for the agency that has been carrying out this activity but also to the 
practitioners (panelists), who participate in the standard setting study. 

It is important to keep in mind when interpreting the results that the test items in the 
subsets were selected based on the actual difficulty level of the items, and proportionately to the 
total number of items within the pre-specified item difficulty-level categories. Another important 
point to consider is that a stratified random sampling technique was used for selecting the items 
in the subsets. The result of the study was found to be sensitive to the item selection methods 
(i.e., to the stratified random sampling method). The study used a simple random sampling 
method an alternative to the stratified random selection method to examine the sensitivity of the 
results. These estimates were found to be unstable and barely equivalent to the MPS for the full- 
length test. 

Sired, et al. (2000) carried out a study on a standard setting methodology with computer 
adaptive tests. The study indicated that MPS derived using only two-thirds of the items 
composing a CAT item pool were very similar to MPS derived using the entire item bank. 
However, the study involved only a single panel of experts and evaluated the method using a 
single test. This study yields much stronger results compared to the study conducted by Sired, et 
al. (2000) because only one half of the test items were needed to reach equivalence. Moreover, 
this study used multiple data sources to examine the stability of the results across different 
subject areas and occasions. Therefore, the results may be generalized within and across different 
content areas. 

The results of this study were limited due to the fact that the classifications of actual 
items within the table of specification were not available. Therefore, only a one-stage stratified 
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random sampling was used to select the test items for the subsets; using only item difficulty level 
categories as the strata. We know that the precision of the estimate is a function of number of 
strata, i.e. more strata results in more precise estimates. Therefore, if item classification by the 
table of specification were available, then a two-stage stratified sampling method could have 
been used to examine the precision of the estimates. Such a strategy may have resulted in the use 
of even less than 50% of the items to yield equivalent results to that from the full-length test. 

Future research should focus on examining of the generalizability of these results and the 
conditions that supported the close proximity of the MPS values for the 50% subsets to the full- 
length tests. More research should be done on investigating the sensitivity of the results to test 
item selection techniques. There are other factors, specific to the particular standard setting 
situations that could reduce time and resources both for agencies conducting this activity and the 
panelists who are participating this activity. These should be researched along with the effects of 
reducing test items. 

The purpose for a standard setting study is to make the most accurate and defensible 
prediction of a minimum passing score possible. Important decisions are made based on these 
passing scores. No decision should be made based on these passing scores that are unfair to the 
candidates’ future. These decisions can influence the candidates’ livelihoods, especially in the 
licensure and credentialing field. A great deal of time and money is spent on the process of 
setting these passing scores. One initiative for this study is to save time and resources but not at 
the cost of setting inaccurate passing scores and doing injustice to the candidates’ future. So, if a 
justifiable, defensible, and an accurate passing score can be set while still reducing time and 
resources, this would be a highly desirable outcome. The need to address the issues of reducing 
time and resources both for the agencies and panelists in a standard setting study is of great 
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importance. The results of this study suggest that may be feasible to set the passing scores with 
the Angoff method using a subset of items from the full-length test. 
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Table 1 



Comparison of the Minimum Passing Scores: The Medical Health Study, 1995 





No. of items 


Estimated MPS 


Absolute difference 


Full-length test 


110 


72 (83) 




Percentage of total test-items in the subsets 






5% 


6 


75 (86) 


3(3) 


10% 


11 


74 (86) 


2(3) 


20% 


21 


73 (84) 


1(1) 


30% 


34 


70 (82) 


2(1) 


40% 


44 


72 (83) 


0(0) 


50% 


55 


73 (82) 


1(1) 


60% 


66 


72 (83) 


0(0) 


70% 


78 


72 (83) 


0(0) 



Note. Numbers in parentheses are the sum of p-values and their absolute differences due to 
sampling. None of the minimum passing scores (MPS) are statistically different from the MPS of 
the full-length test at p = 0.05. 
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Table 2. 



Comparison of the Minimum Passing Scores: The Medical Health Study, 2000 





No. of items 


Estimated MPS 


Absolute difference 


Full-length test 


110 


86 (91) 




Percentage of total test-items in the subsets 






5% 


6 


89 ((4) 


3(3) 


10% 


11 


84 (90) 


2(1) 


20% 


21 


85 (90) 


1(1) 


30% 


34 


87 (91) 


1(0) 


40% 


44 


86 (90) 


0(1) 


50% 


55 


86 (90) 


0(1) 


60% 


66 


86 (90) 


0(1) 


70% 


78 


86 (91) 


0(0) 



Note. Numbers in parentheses are the sum of p-values and their absolute differences due to 
sampling. None of the minimum passing scores (MPS) are statistically different from the MPS of 
the full-length test at p = 0.05. 
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Tables. 

Comparison of the Minimum Passing Scores: Panel A, the Financial Analyst Study, 2001 





No. of items 


Estimated MPS 


Absolute difference 


Full-length test 


230 


157(144) 




Percentage of total test-items in the subsets 






5% 


11 


184(151) 


27* (7) 


10% 


22 


162(141) 


5(3) 


20% 


47 


164(144) 


7(0) 


30% 


69 


158(144) 


1(0) 


40% 


92 


160(144) 


3(0) 


50% 


115 


157(144) 


0(0) 


60% 


138 


156(144) 


1(0) 


70% 


162 


157(144) 


0(0) 



Note. Numbers in parentheses are the sum of p-values and their absolute differences due to 
sampling. Minimum passing scores (MPS) obtained from the subset with 5% of items is 
statistically different from the MPS of the full-length test at p = 0.05. 
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Table 4. 

Comparison of the Minimum Passing Scores: Panel B, the Financial Analyst Study, 2001 



No. of items Estimated MPS Absolute difference 



Full-length test 


230 


143 (144) 




Percentage of total test-items 


in the subsets 






5% 


11 


145(151) 


2(7) 


10% 


22 


148 (142) 


5(2) 


20% 


47 


139(144) 


4(0) 


30% 


69 


145 (144) 


2(0) 


40% 


92 


143 (144) 


0(0) 


50% 


115 


145 (145) 


2(1) 


60% 


138 


145 (143) 


2(1) 


70% 


162 


143 (144) 


0(0) 



Note. Numbers in parentheses are the sum of p-values and their absolute differences due to 
sampling. None of the minimum passing scores (MPS) are statistically different from the MPS of 
the full-length test at p = 0.05. 
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Table 5. 



Comparison of the Minimum Passing Scores with Repeated Samples: The Medical Health 
Studies, 1995 and 2000 





No. of items 


Estimated MPS 


Absolute 

difference 


Medical Health, 1995 (full-length test) 


110 


72 (83) 


- 


Stratified Random Samples 


Subset with 50% items 


55 


73 (82) 


1(1) 


Repeated sample 1 


55 


71 (82) 


1(1) 


Repeated sample 2 


55 


72 (82) 


0(1) 


Simple random sample 


55 


72 (82) 


0(1) 


Medical Health, 2000 (full-length test) 


110 


86(91) 


- 


Stratified Random Samples 


Subset with 50% items 


55 


86 (90) 


0(1) 


Repeated sample 1 


55 


85 (90) 


1(1) 


Repeated sample 2 


55 


85 (90) 


1(1) 


Simple random sample 


55 


84 (87) 


2(4) 



Note. Numbers in the parentheses are the sum of p-values and their absolute differences due to 
the sampling. 
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Table 6. 



Comparison of the Minimum Passing Scores with Repeated Samples: The Financial Analyst 
Study, 2001 





No. of items 


Estimated 

MPS 


Absolute 

difference 


Financial Analyst 


Panel A (full-length test) 


230 


157 (144) 


- 


Stratified Random Samples 


Subset with 50% items 


115 


157 (144) 


0(0) 


Repeated sample 1 


115 


158 (145) 


1(1) 


Repeated sample 2 


115 


157 (145) 


0(1) 


Simple random sample 


115 


164(147) 


7(3) 


Panel B (full-length test) 


230 


143 (144) 


- 


Stratified Random Samples 


Subset with 50% items 


115 


145 (145) 


2(1) 


Repeated sample 1 


115 


144(144) 


1(0) 


Repeated sample 2 


115 


142 (145) 


1(1) 


Simple random sample 


115 


148 (146) 


5(2) 



Note. Numbers in the parentheses are the sum of p-values and their absolute differences due to 
the sampling. 
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